WorkServer White Paper

GNP Computers High Availability White Paper


If you would like a hardcopy of this white paper, complete with clear versions of the technical drawings, please send email to that effect to webmaster@gnp.com.


Table of Contents

[
Next ]
1. INTRODUCTION
2. OVERVIEW OF THE SYSTEM

2.1 Design Objectives

2.2 System Components

3. COMPONENT CONNECTIVITY

3.1 Application Network

3.2 Maintenance Network

4. SOFTWARE ARCHITECTURE

4.1 Normal Operating Conditions

4.2 Software Layers

4.3 HA and Application Software

4.4 CPU State Transitions

5. COMPONENT FAILURE AND REPLACEMENT

5.1 Disk Failure

5.2 Disk Replacement

5.3 SCSI Switch or RAID Controller Failure

5.4 SCSI Switch or RAID Controller Replacement

5.5 CPU or Boot Disk Failure

5.6 CPU Replacement

5.7 Fan Failure and Replacement

5.8 Software Failure

5.9 Software Replacement and Upgrades

6. CONCLUSION

Appendix A Solstice DiskSuite

Appendix B Appendix B Redundant Arrays of Independent Disks (RAID)


1. Introduction [ Previous ] [ Next ] [ T of C ]

[ IMAGE ]

This document describes the hypothetical system called the WorkServer Reference Platform which is illustrated above. This system could be implemented using existing technology developed by GNP Computers for its Telco WorkServerTM product line, a set of modular components for implementing high-availability computer platforms for the telecommunications industry. The system described here, an Internet mail server, is not a product that GNP Computers currently sells, nor is it one that GNP has implemented for any of its customers in precisely the form discussed. Nevertheless, it is representative of the type of platform that can be, and has been implemented using WorkServer technology together with other off-the-shelf third-party products. The following discussion is designed to present some of the issues that are faced in constructing a high-availability computer platform, and to illustrate how the WorkServer products can be used to address them.

2. Overview of the System [ Previous ] [ Next ] [ T of C ]

2.1 Design Objectives [ Previous ] [ Next ] [ T of C ]

The primary goal of the system is to ensure that it reliably provides all of the computing resources needed to support the execution of application software - in the example discussed here, Internet mail server software - which provides a useful service, or set of services, to the system's clients. In the best of all possible worlds, components would never fail, and the system, once configured, would continue to provide the needed resources forever. Unfortunately, components do fail. Ultimately, the responsibility for dealing with failures falls to a human being - for example, to ensure that failed components are repaired or replaced. In the environments for which this system was designed, however, the strategy of relying on a person to detect failures and reconfigure the system to restore lost computing resources is not acceptable, either because a person cannot be expected to take the required action quickly enough to avoid an unacceptable interruption of service, or because it is too expensive to provide the continuous level of staffing needed to respond effectively to such unplanned events.

For this reason, the system is designed to act as a stable platform for application software in the face of component failures, without human intervention. Instead of being equipped with only one set of the necessary components, the system is also equipped with one or more spares for each of these components. For some components, the system uses the spare to completely mask a component failure - that is, make the failure transparent to application software; in other cases, the system can detect a failure and reconfigure itself to replace the failed component with its spare, with minimal effect on the application, and without human intervention.

2.2 System Components [ Previous ] [ Next ] [ T of C ]

In the specific example discussed here, the system is an Internet mail server platform: it supports the execution of software the implements the SMTP and POP3 protocols for routing and storing Internet messages, and for transferring them to the system's clients, mail users. The computing resources needed to support this application are relatively simple:

[ IMAGE ]

Figure 1 -- Application I/O and Maintenance I/O Topologies

In addition to these components required to support the system's primary function, mail service, other components are needed to support maintenance and control of the system:

3. Component Connectivity [ Previous ] [ Next ] [ T of C ]

Figure 1 contains two high-level views of the main system components and their connections to each other and to the outside world. The two views relate, respectively, to the two independent communication networks in the system:

  1. The application network, based on the standard Ethernet and SCSI protocols, is used to transfer application data between the currently active CPU and its peripherals and the outside world. It also supports the exchange of status information between two pairs of high-availability (HA) daemons running on the CPUs. Please refer to Section 4.3 on page 8 for further details of the HA software architecture.

  2. The maintenance network, based on GNP proprietary hardware and software, is used to transport status, alarm and control information between all of the system modules and one or more PSGs.

The PSG is a device that acts as a gateway between the maintenance-network protocol and a standard asynchronous terminal interface. On its terminal interface, the PSG supports a command set that allows an external system to query any system module for maintenance status, to receive module alarm reports, and to change the maintenance state of any module, for example, to power it on or off.

3.1 Application Network [ Previous ] [ Next ] [ T of C ]

As shown in the upper left of Figure 1, an external Ethernet segment acts as the medium over which the system provides mail service to its clients. The system's two CPU modules are connected to this segment via one of their 100Base-T Ethernet interfaces. The CPUs also have direct Ethernet connectivity to each other via their two remaining Ethernet interfaces, which are used to exchange status information between the HA daemons running on the two CPUs.

Each CPU has access, through a pair of SCSI switches, to a pair of RAID controllers and their associated RAID sets. The SCSI switch has two Fast/Wide SCSI "CPU" interfaces, A and B, and two Fast/Wide SCSI "RAID" interfaces, 1 and 2. Each of the two CPU interfaces on each SCSI switch is connected to one of the Fast/Wide SCSI interfaces on one of the CPUs, and one of the RAID interfaces on the SCSI switch is connected to the Fast/Wide SCSI "host" interface on one of the RAID controllers. The other RAID interface on the SCSI switch is connected to a SCSI terminator, and is not shown in the diagram to avoid clutter.

The SCSI switch has two possible states. In state A1B2, as the name suggests, it connects host interface A to RAID interface 1 and host interface B to RAID interface 2. In Figure 1, with both switches in state A1B2, the effect is to connect the upper CPU to both RAID controllers, and to terminate the SCSI chains on the lower CPU. In the other possible state of the SCSI switches, A2B1, the effect is reversed, that is, the upper CPU has both SCSI chains terminated, and the lower CPU is connected to both RAID controllers. The state of a SCSI switch can be changed by manipulating its front-panel switches, or by sending a maintenance command from a PSG to the SCSI switch over the maintenance bus.

Each RAID controller is connected to two chains of SCSI disks. The diagram shows twelve disks, housed in four disk carrier modules, with each RAID controller connected to five disks, and two of the disks connected to the boot SCSI chains of the CPUs to act as Solaris boot devices. By adding additional disk carrier modules, the complement of each RAID set can be increased to as many as fourteen standard SCSI disks per controller. The controller supports standard RAID levels 0, 1, 4 and 5. Please refer to Appendix B for further details of RAID technology.

3.2 Maintenance Network [ Previous ] [ Next ] [ T of C ]

The lower half of Figure 1 shows how the application modules described in the previous section are connected to the maintenance network. In fact, all GNP modules are connected to the maintenance network (including, for example, the system's fans) and most of the features described here also apply to the other modules, but we will concentrate on the modules needed directly to support the application.

Physically, the maintenance network is a free-topology CSMA/CD LAN operating at 78 kbps, that runs through and between the system's midplanes and fan assembly. The fan assembly and the modules described in the previous sections, which are all housed in the system's sub-racks, each contain a dedicated maintenance processor, which is connected to the maintenance network through a set of pins in the midplane. The midplane also supplies electrical power to the maintenance section of each module: this supply, which is provided by the system's Maintenance Modules, is independent of and electrically isolated from the -48 vdc supply used to power the other application sections of each module. Maintenance power is always on, so the maintenance section of a module will begin to operate as soon as it is plugged into the midplane, whether or not the rest of the module - for example, the CPU's motherboard - is also powered on.

The maintenance processor in each module controls the power converters that supply the module's application section: it can turn the converters on and off, and trim the supply voltages. It also monitors the supply voltages and currents, and can raise alarms or shut the module off when pre-set voltage, current or temperature thresholds are crossed. The maintenance processor also controls and responds to the module's front-panel lights and switches, translating user requests into the corresponding maintenance actions, and making alarm conditions visible at the front panel.

All of the maintenance-processor functions described above can be monitored and controlled via messages exchanged between the maintenance processor and a PSG, using the maintenance network. The maintenance-network transport protocol provides an acknowledged datagram service that is used to transfer maintenance data - control and status information - between modules and PSGs. The PSGs, in turn, convert between this protocol and a command-line interface on their asynchronous serial ports.

Some modules have other maintenance-network-based features in addition to these basic functions:

4. Software Architecture [ Previous ] [ Next ] [ T of C ]

Figure 2 contains high-level views of the primary system software components, their relationships to each other, and their states in a normally operating system. The figure is divided into left and right halves representing, respectively, the roles adopted by the system's two CPUs. The view corresponding to each CPU divided again into two halves: the lower illustrates the relationships between the major system software layers, and the upper half is a more detailed view of processes that comprise the highest layer - the Veritas FirstWatch HA software and the application software.

4.1 Normal Operating Conditions [ Previous ] [ Next ] [ T of C ]

Normally, only one of the CPUs will be providing mail service to the system's clients. This active CPU will configure its public Ethernet interface with the "official" IP address of the mail server, and will configure the SCSI switches (using maintenance commands issued to the PSG connected to its terminal port B) to connect this CPU to both RAID controllers.

The other standby CPU does not participate in providing mail service, but monitors the state of the active CPU using the redundant Ethernet HA "heartbeat" links, and is prepared to take over the role of the active CPU if that CPU should fail or voluntarily relinquish its active role. The monitoring of the CPUs by each other, and their transitioning between the service-providing and non-service-providing states, are controlled by Veritas FirstWatch software running on both CPUs.

[ IMAGE ]

Figure 2 -- System Software Components

4.2 Software Layers [ Previous ] [ Next ] [ T of C ]

The system's application and high-availability functionality both rely on a set of services provided by lower layers of software. From the lowest layer up, the software layers are:

4.3 HA and Application Software [ Previous ] [ Next ] [ T of C ]

The FirstWatch environment, illustrated in the upper half of Figure 2, consists of a set of user-level processes and scripts that run on the active and standby CPUs. This software is responsible for ensuring that the application services which the system is supposed to provide to the outside world, and the services that the application software itself requires in order to provide those services, are available at all times. If the software detects a service failure, it notifies maintenance personnel via console messages and entries in a system log, and it executes scripts which are designed to recover from the failure. It contains these components:

If an agent detects a loss of the service it is monitoring, it reports this fact to the HA daemons for logging, and performs the first-stage recovery action. Normally, the first-stage action is to execute a script which attempts to stop and restart the software component which is responsible for providing the service. For example, the agent monitoring SMTP mail service will attempt to stop and restart the sendmail daemon if it detects a loss of service. If the agent is not able to restore the service after a certain (configurable) number of attempts, it reports a service failure to the HA daemons. At this point, the HA daemons normally resort to more drastic recovery actions, such as causing a CPU failover.

The scripts used by each agent to test for availability of its service, the scripts used to restart a software component after a service has failed, and the scripts used to perform CPU state transitions are application-dependent and normally not provided directly as part of the FirstWatch package. Modification of existing prototype scripts, or development of new scripts, is normally required on the part of GNP Computers or the end customer.

4.4 CPU State Transitions [ Previous ] [ Next ] [ T of C ]

A CPU failover is the process of interchanging the states of the currently active and standby CPUs. This involves shutting down the application software on the current active CPU and de-commissioning its I/O interfaces, and moving all application functions to the current standby CPU, making it the new active CPU. The maintenance actions required to execute a failover are performed by scripts run by the HA daemons on each CPU. For the mail server application, there are two such scripts:

5. Component Failure and Replacement [ Previous ] [ Next ] [ T of C ]

With a complete description of the system's hardware and software components in hand, we are ready to address the main issue raised at the beginning of this document: how the system makes use of its components to provide a high-availability platform for the application. There are two aspects to consider in tackling this question. First, we must show how the system responds in the immediate aftermath of a component failure to reduce or eliminate any effect on the application. Secondly - and this is often overlooked in basic discussions of the subject - we must also address the steps that maintenance personnel must take to replace the failed component, and show that they too can be performed without interrupting service. Without this second capability, increased system downtime and increased vulnerability to catastrophic dual failures are inevitable.

5.1 Disk Failure [ Previous ] [ Next ] [ T of C ]

When one of the disks in a RAID set fails, the RAID controller may or may not be able to completely mask the failure, depending on the RAID level at which the set is configured:

Clearly, if the RAID controller is able to mask a disk failure, no other immediate recovery action is required. However, if the RAID controller is not able to mask the failure, either because it is configured at RAID level 0, or because it has sustained more than one disk failure, the CPU will no longer be able to communicate with that RAID set. At this point, the mirroring feature of Solstice DiskSuite comes into play: DiskSuite will take the affected sub-mirror devices off-line, but if the CPU can still access the other RAID set, reads and writes to the DiskSuite mirror device will proceed as normal. The DiskSuite monitoring agent will detect the failure of the sub-mirror devices and issue a warning message to the HA log.

5.2 Disk Replacement [ Previous ] [ Next ] [ T of C ]

The failed disk can be removed and replaced without powering off any other components.1 After the drive has been replaced, it must be re-incorporated in the RAID set: if the RAID set itself failed as a result of the disk failure, the RAID controller will need to be restarted; otherwise, this operation can be performed while the RAID set is still on-line. After a failed RAID set has been restored, the corresponding DiskSuite sub-mirror devices will need to be re-synchronized with the active sub-mirrors, using DiskSuite maintenance commands executed on the active CPU.

1 - This feature is usually referred to as the hot-swap capability.

5.3 SCSI Switch or RAID Controller Failure [ Previous ] [ Next ] [ T of C ]

The failure of a SCSI switch or RAID controller is equivalent to the failure of a RAID set as described above, because the active CPU will no longer be able to access the affected RAID set. The response of the Solstice DiskSuite mirroring software will also be as described above, and will serve to mask the failure for application software.

5.4 SCSI Switch or RAID Controller Replacement [ Previous ] [ Next ] [ T of C ]

The SCSI Switch or RAID controller can be replaced without powering off any other components. After the failed component has been replaced, the affected DiskSuite sub-mirror devices will need to be re-synchronized with the active sub-mirrors, using DiskSuite maintenance commands executed on the active CPU.

5.5 CPU or Boot Disk Failure [ Previous ] [ Next ] [ T of C ]

Catastrophic failure of the active or standby CPU or their boot drives will be detected by the HA daemons running on the other CPU, when they stop receiving heartbeats from the mate CPU. If the standby CPU failed, the HA daemons will immediately report this fact in the HA logs, but will take no further action. If the active CPU failed, the standby CPU will cause a failover, after waiting for a configurable period of time for the active CPU to recover. As a result, the application software will resume execution on the new active CPU, and will continue to provide mail service to the outside world.

If failure of the active CPU or its boot drive is not catastrophic - for example, if a single SBus card fails - one or more of the service-monitoring agents will detect the failure, or more accurately, the effect of the failure on the service, and will report this failure to the HA agents. Depending on the severity level associated with the agent, the HA daemons may respond to the report by causing a failover. All failures reported by agents are also recorded in the HA log.

5.6 CPU Replacement [ Previous ] [ Next ] [ T of C ]

A failed CPU module can be replaced without powering off any other components.2 In addition, because of the extra level of isolation provided by the SCSI switch modules, the failed CPU can be removed without affecting the integrity of the active CPU's connections to the RAID sets. This would not be the case, for example, if the CPUs and RAID controllers were connected to each other in the following way, as is sometimes suggested in other literature:

[ IMAGE ]

Figure 3 -- A Non-Optimal CPU-to-RAID Connection Scheme

With this scheme, if one of the CPUs is removed from the system for repair, both SCSI chains are no longer properly terminated. As a result, for the remaining CPU, all SCSI transactions on these chains are likely to fail, resulting in an unrecoverable dual failure of the DiskSuite sub-mirror devices, and complete loss of service. In contrast, because the SCSI switches isolate the two CPUs' connections to the RAID controllers, removing one CPU has no effect on the other. This is the primary motivation for introducing the SCSI switches into the reference architecture.

2 - The boot-disk replacement procedure is identical to the procedure for replacing a RAID-set disk described in Section 5.2.

5.7 Fan Failure and Replacement [ Previous ] [ Next ] [ T of C ]

The failure of any of the system's fans will be detected by the fan assembly's maintenance processor, and will result in a visible alarm on the fan assembly and an alarm notification on the maintenance network. The system can sustain failure of up to half of its fans without violating its published operating-environment specifications - that is, the system will not fail because of overheating. Each fan can be completely replaced without affecting any other system components, including the other fans.

5.8 Software Failure [ Previous ] [ Next ] [ T of C ]

The failure of system or application software is normally detected by agents running on the affected CPU, or by the HA daemons running on its mate if the failure is severe enough to cause the CPU to crash or hang. The agents or HA daemons will take the necessary recovery action, such as restarting the failed component or causing a CPU fail-over. Since most software failures in well-tested telecommunication systems are triggered by transient environmental conditions or improper configuration data, the recovery action normally has the desired effect of restoring the affected application service for an extended period.

5.9 Software Replacement and Upgrades [ Previous ] [ Next ] [ T of C ]

The ability to replace or upgrade software on a live system is one of the main advantages of a system with loosely-coupled redundant components, when compared with a tightly-coupled system in which hardware components operate in lock step. For example, in the system described here, it is possible to shut down the standby CPU, reload any or all software components using the CPU's Removable Media Module (CD-ROM or 4mm DAT), boot the CPU and execute component-level and selected application-level tests that verify the new software configuration, without interfering with the application running on the active CPU. Using the ability to cause a CPU fail-over from the FirstWatch maintenance console, it is also possible to soak the new software configuration under live traffic for a pre-determined interval, while retaining the ability to immediately fail-back to the earlier configuration if unexpected problems arise. After the new application configuration has been tested for as long as is necessary under load, the other CPU can be upgraded in the same manner as the first. The potential sticking point of compatibility between the old and new application data formats usually must be tackled anyway as part of the software upgrade plan.

6. Conclusion [ Previous ] [ Next ] [ T of C ]

We have described the construction and operation of a hypothetical reference system, consisting of GNP Telco WorkServer components together with off-the-shelf third-party products, that performs as a high-availability Internet mail server. We described the roles played by the system's industry-standard components and I/O interfaces in providing the capabilities needed to execute application software. We also described the role played by the WorkServer's Intelligent Maintenance Network in providing low-level monitoring and control access to all components, and how this capability can be used by HA software running on the system's CPUs, and by external systems connected to the PSGs, to manage the system. We showed how the HA software is able to monitor the services provided by lower-level system software and the application, and take the necessary recovery action if services fail, including the ability to migrate all system functions to the standby CPU. Finally, we reviewed the system's response to individual component failures, showing that it is possible for the system to mask or reconfigure around failures, and for maintenance personnel to take the necessary repair actions, without affecting the level of service provided by the mail server.

Appendix A Solstice DiskSuite [ Previous ] [ Next ] [ T of C ]

The SunSoft Solstice DiskSuite 4.0 software package provides a number of features that enhance the performance, reliability and manageability of sets of disks attached to a single computer, and simplify the use of sets of disks for storing application data. The central feature provided by DiskSuite is the ability for application software to treat a set of disk partitions as a single logical device, called a metadevice. DiskSuite software uses the following techniques to implement metadevices with better characteristics than the device, implemented by the base SunOS kernel, corresponding to any single disk partition:

Striping is similar to concatenation except that the addressing of the metadevice blocks is interlaced on the components, rather than addressed sequentially, in order to achieve higher performance. The interlace value for striping is user-definable, and can be tuned for specific read/write performance characteristics.

To set up mirroring, one creates a meta-mirror, which is a metadevice made up of one or more other metadevices, which are called sub-mirrors. Once a meta-mirror is defined, additional sub-mirrors can be added at a later date without bringing the system down or disrupting reads and writes to existing sub-mirrors. When a sub-mirror is attached, all the data from another sub-mirror in the meta-mirror is automatically written to the newly attached sub-mirror - this process is called resyncing.

A pseudo device, called the metatrans device, is responsible for managing the contents of the log of file system updates. Like other metadevices, the metatrans device behaves like an ordinary disk device. The metatrans device is made up of two sub-devices: the logging device and the master devices. The logging device contains the log of file system updates, that is, a sequence of records, each describing a change to a file system. The master device contains an existing or a newly created UFS file system. The master device can contain an existing UFS file system because creating a metatrans device does not alter the master device. The difference is that updates to the file system are written to the log before being "rolled forward" to the UFS file system. The master device is never left in an inconsistent state, and DiskSuite software can examine the transaction log after a system crash to recover file-system changes that were not committed to the master device.

Appendix B Redundant Arrays of Independent Disks (RAID) [ Previous ] [ T of C ]

RAID technology provides an elegant solution for reliable data storage. The basic idea of RAID is to use the combined storage capacity and I/O bandwidth of a set of disks, together with an microprocessor-based RAID controller which has I/O connectivity to all of the disks and to one or more host computers, to implement a storage device that has greater capacity, throughput and reliability than any of the component disks.


For more information about the WorkServer, High Availability, or other GNP Computers products and news, please contact us or visit our Website at http://www.gnp.com.


WorkServer Photos | WorkServer Brochure | WorkServer White Paper

High Availability White Paper | Technical Specifications | More Info


[ Up ][ Home ][ Index ][ Feedback
] [ Contact
]

Copyright © 1996 GNP Computers, Inc.